Goto

Collaborating Authors

 Huánuco Department


A Appendix

Neural Information Processing Systems

The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.



Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages

Omnilingual ASR team, null, Keren, Gil, Kozhevnikov, Artyom, Meng, Yen, Ropers, Christophe, Setzler, Matthew, Wang, Skyler, Adebara, Ife, Auli, Michael, Balioglu, Can, Chan, Kevin, Cheng, Chierh, Chuang, Joe, Droof, Caley, Duppenthaler, Mark, Duquenne, Paul-Ambroise, Erben, Alexander, Gao, Cynthia, Gonzalez, Gabriel Mejia, Lyu, Kehan, Miglani, Sagar, Pratap, Vineel, Sadagopan, Kaushik Ram, Saleem, Safiyyah, Turkatenko, Arina, Ventayol-Boada, Albert, Yong, Zheng-Xin, Chung, Yu-An, Maillard, Jean, Moritz, Rashel, Mourachko, Alexandre, Williamson, Mary, Yates, Shireen

arXiv.org Artificial Intelligence

Automatic speech recognition (ASR) has advanced in high-resource languages, but most of the world's 7,000+ languages remain unsupported, leaving thousands of long-tail languages behind. Expanding ASR coverage has been costly and limited by architectures that restrict language support, making extension inaccessible to most--all while entangled with ethical concerns when pursued without community collaboration. To transcend these limitations, we introduce Omnilingual ASR, the first large-scale ASR system designed for extensibility. Omnilingual ASR enables communities to introduce unserved languages with only a handful of data samples. It scales self-supervised pre-training to 7B parameters to learn robust speech representations and introduces an encoder-decoder architecture designed for zero-shot generalization, leveraging a LLM-inspired decoder. This capability is grounded in a massive and diverse training corpus; by combining breadth of coverage with linguistic variety, the model learns representations robust enough to adapt to unseen languages. Incorporating public resources with community-sourced recordings gathered through compensated local partnerships, Omnilingual ASR expands coverage to over 1,600 languages, the largest such effort to date--including over 500 never before served by ASR. Automatic evaluations show substantial gains over prior systems, especially in low-resource conditions, and strong generalization. We release Omnilingual ASR as a family of models, from 300M variants for low-power devices to 7B for maximum accuracy. We reflect on the ethical considerations shaping this design and conclude by discussing its societal impact. In particular, we highlight how open-sourcing models and tools can lower barriers for researchers and communities, inviting new forms of participation. Open-source artifacts are available at https://github.com/facebookresearch/omnilingual-asr.




A Appendix A.1 LangID Details

Neural Information Processing Systems

The complete list may be seen in Table 8. Here are a few general notes about these strings: 1. Based on their recommendations, we did the following: 1. zh, zh_Latn: This resulted in the special filters described below. URLs) the corpora were in languages different from the LangID predictions. This is mainly mis-rendered PDFs and may have practical applications for denoising, or for decoding such garbled PDFs.


Type-Less yet Type-Aware Inductive Link Prediction with Pretrained Language Models

De Bellis, Alessandro, Bufi, Salvatore, Servedio, Giovanni, Anelli, Vito Walter, Di Noia, Tommaso, Di Sciascio, Eugenio

arXiv.org Artificial Intelligence

Inductive link prediction is emerging as a key paradigm for real-world knowledge graphs (KGs), where new entities frequently appear and models must generalize to them without retraining. Predicting links in a KG faces the challenge of guessing previously unseen entities by leveraging generalizable node features such as subgraph structure, type annotations, and ontological constraints. However, explicit type information is often lacking or incomplete. Even when available, type information in most KGs is often coarse-grained, sparse, and prone to errors due to human annotation. In this work, we explore the potential of pre-trained language models (PLMs) to enrich node representations with implicit type signals. We introduce TyleR, a Type-less yet type-awaRe approach for subgraph-based inductive link prediction that leverages PLMs for semantic enrichment. Experiments on standard benchmarks demonstrate that TyleR outperforms state-of-the-art baselines in scenarios with scarce type annotations and sparse graph connectivity. To ensure reproducibility, we share our code at https://github.com/sisinflab/tyler .


Private LoRA Fine-tuning of Open-Source LLMs with Homomorphic Encryption

Frery, Jordan, Bredehoft, Roman, Klemsa, Jakub, Meyre, Arthur, Stoian, Andrei

arXiv.org Artificial Intelligence

Preserving data confidentiality during the fine-tuning of open-source Large Language Models (LLMs) is crucial for sensitive applications. This work introduces an interactive protocol adapting the Low-Rank Adaptation (LoRA) technique for private fine-tuning. Homomorphic Encryption (HE) protects the confidentiality of training data and gradients handled by remote worker nodes performing the bulk of computations involving the base model weights. The data owner orchestrates training, requiring minimal local computing power and memory, thus alleviating the need for expensive client-side GPUs. We demonstrate feasibility by fine-tuning a Llama-3.2-1B model, presenting convergence results using HE-compatible quantization and performance benchmarks for HE computations on GPU hardware. This approach enables applications such as confidential knowledge base question answering, private codebase fine-tuning for AI code assistants, AI agents for drafting emails based on a company's email archive, and adapting models to analyze sensitive legal or healthcare documents.


A Grounded Typology of Word Classes

Haley, Coleman, Goldwater, Sharon, Ponti, Edoardo

arXiv.org Artificial Intelligence

We propose a grounded approach to meaning in language typology. We treat data from perceptual modalities, such as images, as a language-agnostic representation of meaning. Hence, we can quantify the function--form relationship between images and captions across languages. Inspired by information theory, we define "groundedness", an empirical measure of contextual semantic contentfulness (formulated as a difference in surprisal) which can be computed with multilingual multimodal language models. As a proof of concept, we apply this measure to the typology of word classes. Our measure captures the contentfulness asymmetry between functional (grammatical) and lexical (content) classes across languages, but contradicts the view that functional classes do not convey content. Moreover, we find universal trends in the hierarchy of groundedness (e.g., nouns > adjectives > verbs), and show that our measure partly correlates with psycholinguistic concreteness norms in English. We release a dataset of groundedness scores for 30 languages. Our results suggest that the grounded typology approach can provide quantitative evidence about semantic function in language.


Optimization of Energy Consumption Forecasting in Puno using Parallel Computing and ARIMA Models: An Innovative Approach to Big Data Processing

Vilca-Tinta, Cliver W., Torres-Cruz, Fred, Quispe-Morales, Josefh J.

arXiv.org Machine Learning

This research presents an innovative use of parallel computing with the ARIMA (AutoRegressive Integrated Moving Average) model to forecast energy consumption in Peru's Puno region. The study conducts a thorough and multifaceted analysis, focusing on the execution speed, prediction accuracy, and scalability of both sequential and parallel implementations. A significant emphasis is placed on efficiently managing large datasets. The findings demonstrate notable improvements in computational efficiency and data processing capabilities through the parallel approach, all while maintaining the accuracy and integrity of predictions. This new method provides a versatile and reliable solution for real-time predictive analysis and enhances energy resource management, which is particularly crucial for developing areas. In addition to highlighting the technical advantages of parallel computing in this field, the study explores its practical impacts on energy planning and sustainable development in regions like Puno.